Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check for malloc failure in libpq calls #5446

Merged
merged 1 commit into from Mar 16, 2023

Conversation

mkindahl
Copy link
Contributor

@mkindahl mkindahl commented Mar 15, 2023

The functions PQconndefaults and PQmakeEmptyPGresult calls
malloc and can return NULL if it fails to allocate memory for the
defaults and the empty result. It is checked with an Assert, but this
will be removed in production builds.

Replace the Assert with an checks to generate an error in production
builds rather than trying to de-reference the pointer and cause a
crash.

Fixes #5447

@codecov
Copy link

codecov bot commented Mar 15, 2023

Codecov Report

Merging #5446 (fc46c84) into main (790b322) will decrease coverage by 0.05%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##             main    #5446      +/-   ##
==========================================
- Coverage   90.72%   90.68%   -0.05%     
==========================================
  Files         228      228              
  Lines       53093    53089       -4     
==========================================
- Hits        48171    48145      -26     
- Misses       4922     4944      +22     
Impacted Files Coverage Δ
src/utils.h 80.00% <ø> (ø)
tsl/src/remote/connection.c 88.86% <66.66%> (-0.10%) ⬇️

... and 6 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mkindahl mkindahl force-pushed the libpq-checks branch 2 times, most recently from 1e2cddd to 09def5e Compare March 15, 2023 14:30
@mkindahl mkindahl marked this pull request as ready for review March 15, 2023 14:30
@mkindahl mkindahl changed the title Add check for malloc failure Add check for malloc failure in libpq calls Mar 16, 2023
/* probably OOM */
elog(ERROR, "could not get default libpq options");
}
Ensure(libpq_options, "out of memory");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be the most common error, as an improvement I suggest to make a new macro like Ensure_oom (or something like that) using ERRCODE_OUT_OF_MEMORY code underneath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Makes more sense than using ERRCODE_INTERNAL_ERROR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just ereport? Ensure is for internal program errors, and OOM is an environment error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just ereport? Ensure is for internal program errors, and OOM is an environment error.

I think it is a matter of definitions. IMHO, Ensure is used when environmental issues can cause problems as well, such as bad catalog data.

No strong opinion on what to use here, but if using ERRCODE_OUT_OF_MEMORY it is simpler to replace with ereport.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is an improvement over the previous error. The pointer was already checked for NULL and the new error message is more generic than the previous one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is an improvement over the previous error. The pointer was already checked for NULL and the new error message is more generic than the previous one.

But AFAICT, it only returns NULL if it fails to allocate memory. I have no strong opinion, I can replace it with the original code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a matter of definitions.

Yes, it would be good to work out some guidelines for this. Being able to distinguish program logic errors is important for fuzzing, like SQLSmith, or some kinds of stress testing. Things like Assert and elog are such errors, if you see them under any conditions, it's a bug. We could make SQLSmith more efficient by adding some handling for ERRCODE_INTERNAL_ERROR errors, and using elog vs ereport in a more principled manner. In the context of testing, catalog corruption is probably also an internal error? Not sure...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of testing, catalog corruption is probably also an internal error? Not sure...

Oh, I think it is, but it can be caused by factors out of our control, so we should have checks for it and just not assume that it is always correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a macro TS_OOM_CHECK and used that instead. It calls ereport.

Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is doing the right thing.

We should have these checks where we allocate the memory, and not where we use or pass the pointers around.

We also want to try to avoid throwing errors in the networking/connection code (there is actually an effort to get away from that), because errors fail the current transaction, which is not what you always want to do (although local memory pressure might be an exception). For example, if we have a function that runs a health check on all DN connections it will fail the healthcheck on all connections if only one of them experiences an error. Therefore, we should pass the error up to higher layers, which can then decide if throwing an error is appropriate.

@@ -397,7 +398,7 @@ handle_conn_destroy(PGEventConnDestroy *event)
unsigned int results_count = 0;
ListNode *curr;

Assert(NULL != conn);
Ensure(conn, "no instance data available");
Copy link
Contributor

@erimatnor erimatnor Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this really should be an Assert() because it checks something that shouldn't happen. PQinstanceData just literally gets a pointer from an array and will only fail if it is not found or a passed in pointer is NULL.

The check should be elsewhere, e.g., where the allocation of the instance data occurs. If we follow this pattern, we will just have unused error messages every time we reference a pointer. Instead, this error is just masking the real issue: that a connection wasn't properly allocated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this really should be an Assert() because it checks something that shouldn't happen. PQinstanceData just literally gets a pointer from an array and will only fail if it is not found or a passed in pointer is NULL.

Yes, and the point is that it is not entirely clear if we can have a situation where the pointer is not found. If we have, the server will crash in production if we only use an Assert. This is the idea of being defensive and rather error out than crash if we make a mistake.

The check should be elsewhere, e.g., where the allocation of the instance data occurs. If we follow this pattern, we will just have unused error messages every time we reference a pointer.

Not sure I follow you. For this case we will have an error message instead of a crash; not every time we reference a pointer. Yes, it will mostly not trigger, but that is just code so I do not see the problem.

Anyway, I'll remove these defensive checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored the Assert.

@@ -441,7 +442,7 @@ handle_result_create(PGEventResultCreate *event)
TSConnection *conn = PQinstanceData(event->conn, eventproc);
ResultEntry *entry;

Assert(NULL != conn);
Ensure(conn, "no instance data available");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this should be an Assert(). The connection is allocated elsewhere and that's where we should have the check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, we replace a crash in production builds with an error message. Not every time we reference the instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored the Assert.

@@ -452,7 +453,10 @@ handle_result_create(PGEventResultCreate *event)
entry->result = event->result;
/* Add entry as new head and set instance data */
list_insert_after(&entry->ln, &conn->results);
PQresultSetInstanceData(event->result, eventproc, entry);
if (!PQresultSetInstanceData(event->result, eventproc, entry))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only fail if you pass in bad pointers. We should have the checks where those pointers are set/allocated instead.

If this is really what we want to do, it would be better to wrap these functions in, e.g., a MACRO, and raise the error in the wrapper instead of doing it in every place they are called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is the only place where we use that function, so it seems pointless to add a macro for that reason alone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this check.

/* probably OOM */
elog(ERROR, "could not get default libpq options");
}
Ensure(libpq_options, "out of memory");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is an improvement over the previous error. The pointer was already checked for NULL and the new error message is more generic than the previous one.

@@ -1024,6 +1026,7 @@ remote_connection_get_result(const TSConnection *conn, TimestampTz endtime)
if (PQconsumeInput(conn->pg_conn) == 0)
{
pgres = PQmakeEmptyPGresult(conn->pg_conn, PGRES_FATAL_ERROR);
Ensure(pgres, "out of memory");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check makes sense, as opposed to most of the other ones, since this one actually allocates memory.

Copy link
Contributor Author

@mkindahl mkindahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be more defensive in our coding practices. The referenced bug demonstrates quite clearly that this is something we need to improve on.

I'll remove the defensive checks around the instance handling though.

@@ -397,7 +398,7 @@ handle_conn_destroy(PGEventConnDestroy *event)
unsigned int results_count = 0;
ListNode *curr;

Assert(NULL != conn);
Ensure(conn, "no instance data available");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this really should be an Assert() because it checks something that shouldn't happen. PQinstanceData just literally gets a pointer from an array and will only fail if it is not found or a passed in pointer is NULL.

Yes, and the point is that it is not entirely clear if we can have a situation where the pointer is not found. If we have, the server will crash in production if we only use an Assert. This is the idea of being defensive and rather error out than crash if we make a mistake.

The check should be elsewhere, e.g., where the allocation of the instance data occurs. If we follow this pattern, we will just have unused error messages every time we reference a pointer.

Not sure I follow you. For this case we will have an error message instead of a crash; not every time we reference a pointer. Yes, it will mostly not trigger, but that is just code so I do not see the problem.

Anyway, I'll remove these defensive checks.

@@ -441,7 +442,7 @@ handle_result_create(PGEventResultCreate *event)
TSConnection *conn = PQinstanceData(event->conn, eventproc);
ResultEntry *entry;

Assert(NULL != conn);
Ensure(conn, "no instance data available");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, we replace a crash in production builds with an error message. Not every time we reference the instance.

/* probably OOM */
elog(ERROR, "could not get default libpq options");
}
Ensure(libpq_options, "out of memory");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is an improvement over the previous error. The pointer was already checked for NULL and the new error message is more generic than the previous one.

But AFAICT, it only returns NULL if it fails to allocate memory. I have no strong opinion, I can replace it with the original code.

@@ -452,7 +453,10 @@ handle_result_create(PGEventResultCreate *event)
entry->result = event->result;
/* Add entry as new head and set instance data */
list_insert_after(&entry->ln, &conn->results);
PQresultSetInstanceData(event->result, eventproc, entry);
if (!PQresultSetInstanceData(event->result, eventproc, entry))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is the only place where we use that function, so it seems pointless to add a macro for that reason alone.

/* probably OOM */
elog(ERROR, "could not get default libpq options");
}
Ensure(libpq_options, "out of memory");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of testing, catalog corruption is probably also an internal error? Not sure...

Oh, I think it is, but it can be caused by factors out of our control, so we should have checks for it and just not assume that it is always correct.

@mkindahl
Copy link
Contributor Author

We also want to try to avoid throwing errors in the networking/connection code (there is actually an effort to get away from that), because errors fail the current transaction, which is not what you always want to do (although local memory pressure might be an exception). For example, if we have a function that runs a health check on all DN connections it will fail the healthcheck on all connections if only one of them experiences an error. Therefore, we should pass the error up to higher layers, which can then decide if throwing an error is appropriate.

But all the instances that were added are throwing an error instead of crashing, and it is not trivial to capture a crashing process and deal with it elsewhere in the code.

The functions `PQconndefaults` and `PQmakeEmptyPGresult` calls
`malloc` and can return NULL if it fails to allocate memory for the
defaults and the empty result. It is checked with an `Assert`, but this
will be removed in production builds.

Replace the `Assert` with an checks to generate an error in production
builds rather than trying to de-reference the pointer and cause a
crash.
@mkindahl mkindahl merged commit 67ff84e into timescale:main Mar 16, 2023
@mkindahl mkindahl deleted the libpq-checks branch March 16, 2023 13:20
@mkindahl mkindahl added this to the TimescaleDB 2.10.2 milestone Mar 29, 2023
akuzm added a commit to akuzm/timescaledb that referenced this pull request Apr 19, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* timescale#5410 Fix file trailer handling in the COPY fetcher
* timescale#5446 Add checks for malloc failure in libpq calls
* timescale#5233 Out of on_proc_exit slots on guc license change
* timescale#5428 Use consistent snapshots when scanning metadata
* timescale#5499 Do not segfault on large histogram() parameters
* timescale#5470 Ensure superuser perms during copy/move chunk
* timescale#5500 Fix when no FROM clause in continuous aggregate definition
* timescale#5433 Fix join rte in CAggs with joins
* timescale#5556 Fix duplicated entries on timescaledb_experimental.policies view
* timescale#5462 Fix segfault after column drop on compressed table
* timescale#5543 Copy scheduled_jobs list before sorting it
* timescale#5497 Allow named time_bucket arguments in Cagg definition
* timescale#5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* timescale#5558 Use regrole for job owner
* timescale#5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
akuzm added a commit that referenced this pull request Apr 19, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* #5410 Fix file trailer handling in the COPY fetcher
* #5446 Add checks for malloc failure in libpq calls
* #5233 Out of on_proc_exit slots on guc license change
* #5428 Use consistent snapshots when scanning metadata
* #5499 Do not segfault on large histogram() parameters
* #5470 Ensure superuser perms during copy/move chunk
* #5500 Fix when no FROM clause in continuous aggregate definition
* #5433 Fix join rte in CAggs with joins
* #5556 Fix duplicated entries on timescaledb_experimental.policies view
* #5462 Fix segfault after column drop on compressed table
* #5543 Copy scheduled_jobs list before sorting it
* #5497 Allow named time_bucket arguments in Cagg definition
* #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* #5558 Use regrole for job owner
* #5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
akuzm added a commit that referenced this pull request Apr 19, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* #5410 Fix file trailer handling in the COPY fetcher
* #5446 Add checks for malloc failure in libpq calls
* #5233 Out of on_proc_exit slots on guc license change
* #5428 Use consistent snapshots when scanning metadata
* #5499 Do not segfault on large histogram() parameters
* #5470 Ensure superuser perms during copy/move chunk
* #5500 Fix when no FROM clause in continuous aggregate definition
* #5433 Fix join rte in CAggs with joins
* #5556 Fix duplicated entries on timescaledb_experimental.policies view
* #5462 Fix segfault after column drop on compressed table
* #5543 Copy scheduled_jobs list before sorting it
* #5497 Allow named time_bucket arguments in Cagg definition
* #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* #5558 Use regrole for job owner
* #5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
akuzm added a commit that referenced this pull request Apr 19, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* #5410 Fix file trailer handling in the COPY fetcher
* #5446 Add checks for malloc failure in libpq calls
* #5233 Out of on_proc_exit slots on guc license change
* #5428 Use consistent snapshots when scanning metadata
* #5499 Do not segfault on large histogram() parameters
* #5470 Ensure superuser perms during copy/move chunk
* #5500 Fix when no FROM clause in continuous aggregate definition
* #5433 Fix join rte in CAggs with joins
* #5556 Fix duplicated entries on timescaledb_experimental.policies view
* #5462 Fix segfault after column drop on compressed table
* #5543 Copy scheduled_jobs list before sorting it
* #5497 Allow named time_bucket arguments in Cagg definition
* #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* #5558 Use regrole for job owner
* #5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
akuzm added a commit to akuzm/timescaledb that referenced this pull request Apr 19, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* timescale#5410 Fix file trailer handling in the COPY fetcher
* timescale#5446 Add checks for malloc failure in libpq calls
* timescale#5233 Out of on_proc_exit slots on guc license change
* timescale#5428 Use consistent snapshots when scanning metadata
* timescale#5499 Do not segfault on large histogram() parameters
* timescale#5470 Ensure superuser perms during copy/move chunk
* timescale#5500 Fix when no FROM clause in continuous aggregate definition
* timescale#5433 Fix join rte in CAggs with joins
* timescale#5556 Fix duplicated entries on timescaledb_experimental.policies view
* timescale#5462 Fix segfault after column drop on compressed table
* timescale#5543 Copy scheduled_jobs list before sorting it
* timescale#5497 Allow named time_bucket arguments in Cagg definition
* timescale#5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* timescale#5558 Use regrole for job owner
* timescale#5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
akuzm added a commit that referenced this pull request Apr 20, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* #5410 Fix file trailer handling in the COPY fetcher
* #5446 Add checks for malloc failure in libpq calls
* #5233 Out of on_proc_exit slots on guc license change
* #5428 Use consistent snapshots when scanning metadata
* #5499 Do not segfault on large histogram() parameters
* #5470 Ensure superuser perms during copy/move chunk
* #5500 Fix when no FROM clause in continuous aggregate definition
* #5433 Fix join rte in CAggs with joins
* #5556 Fix duplicated entries on timescaledb_experimental.policies view
* #5462 Fix segfault after column drop on compressed table
* #5543 Copy scheduled_jobs list before sorting it
* #5497 Allow named time_bucket arguments in Cagg definition
* #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* #5558 Use regrole for job owner
* #5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
akuzm added a commit that referenced this pull request Apr 20, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* #5410 Fix file trailer handling in the COPY fetcher
* #5446 Add checks for malloc failure in libpq calls
* #5233 Out of on_proc_exit slots on guc license change
* #5428 Use consistent snapshots when scanning metadata
* #5499 Do not segfault on large histogram() parameters
* #5470 Ensure superuser perms during copy/move chunk
* #5500 Fix when no FROM clause in continuous aggregate definition
* #5433 Fix join rte in CAggs with joins
* #5556 Fix duplicated entries on timescaledb_experimental.policies view
* #5462 Fix segfault after column drop on compressed table
* #5543 Copy scheduled_jobs list before sorting it
* #5497 Allow named time_bucket arguments in Cagg definition
* #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* #5558 Use regrole for job owner
* #5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: OOM errors should not crash timescaledb extension
6 participants